ScreenIT Quality Assurance Results

Comparison between COVID preprint analyses with ScreenIT before and after update

The data prior to the update were taken from the latest database. The updated version screenings were done via a different API that only sees the PDFs and no meta-data from the preprint servers. Two hundred preprints were selected for the comparison, but due to bugs in the pipeline, four preprints could not be screened with the updated pipeline, resulting in a total of 196 screened papers.

Sciscore Results

Compared to the previous version, the updated version more often delivered “not required” for the ethics statement, for example for modeling papers. However, this affected all downstream analyses as well, even in the case when there were actually statements related to e.g. randomization, attrition, etc. (Figure 1). In addition, a couple of funding statements were incorrectly detected as ethics statements, attrition had a few false positives, blinding a few false negatives, and power analysis a couple of false positives and one false negative.

Figure 1: Comparison between Sciscore results in the preivous and updated pipeline versions. Only cases where the two versions yielded incongruent results are shown. Cases are also split by the result of the manual validation with true positives shown as solid bars and true negatives shown as striped bars.

rtransparent Results

A full 100% of preprints in the data set had conflict of interest statements and funding statements, however only some had included these in the pdf of the manuscript (Figure 2). As the updated version screened only the pdf input, the manual assessment also was based only on the text in the pdf. The updated version of the pipeline missed coi statements if they were named with non-standard names (e.g. “conflicts:” or without section title) or if they were given on the first page of the manuscript.

Similarly, the updated version of the pipeline missed funding statements if they were named with non-standard names (e.g. “financial disclosure”, “financing”, “funding/support”, etc.) or if the funding information was in the acknowledgements. Sometimes statements were missed if they were found on the first page of the pdf (the text screened was missing the first page).

The updated version of the pipeline did not detect registration numbers in 16 cases where the previous version did. The majority (14) of those were correct calls, with the exception of two cases where a PROSPERO registration number was cited but not found in the preprint.

Figure 2: Comparison between rtransparent results in the preivous and updated pipeline versions. Only cases where the two versions yielded incongruent results are shown. Cases are also split by the result of the manual validation with true positives shown as solid bars and true negatives shown as striped bars.

limitation-recognizer Results

There were only three discrepancies between the previous and updated pipeline versions (Figure 3). In all three the updated version caught limitations that the previous version did not.

Figure 3: Comparison between limitation-recognizer results in the preivous and updated pipeline versions. Only cases where the two versions yielded incongruent results are shown. Cases are also split by the result of the manual validation with true positives shown as solid bars and true negatives shown as striped bars.

TrialIdentifier Results

The updated version of TrialIdentifier yields several what appear to be false positives (Figure 4). Some of these are grant numbers or accession numbers given in supplemental tables.

Figure 4: Comparison between TrialIdentifier results in the preivous and updated pipeline versions. Only cases where the two versions yielded incongruent results are shown.

False positive detection by TrialIdentifier for EUDRA 201501087122.

JetFighter Results

The updated JetFighter version detected seven papers that the previous version did not (Figure 5). In addition, it (falsely?) detected the fluorescent microscopy image shown in Figure 6.

Figure 5: Comparison between JetFighter results in the preivous and updated pipeline versions. Only cases where the two versions yielded incongruent results are shown.

Figure 6: False positive detection by JetFighter.

Barzooka Results

In addition to the comparison of the previous and updated version of the pipeline (for all tools listed above), we also compared the performance of Barzooka based on two different types of input: the individually extracted image files during pipeline processing vs. a folder of pdfs with the same preprints. Thus, the main difference was on the level of analysis, either figure-based (Barzooka in pipeline) vs. page-based (stand-alone Barzooka). Two hundred papers were screened with either Barzooka version (pipeline vs. stand-alone) and the cases where there were discrepancies between the two versions were manually validated.

Discrepancies between the two Barzooka versions on the presence or absence of a figure type were detected in 103 out of 200 papers, with discrepancies found for all figure types (Fig. Figure 7).

Figure 7: Comparison between Barzooka results from the stand-alone (yellow) and pipeline (purple) versions. Only cases where the two versions yielded incongruent results are shown. Cases are also split by the result of the manual validation with true positives shown as solid bars and true negatives shown as striped bars.

For most categories, especially “approp”, “bardot”, “dot”, and “pie”, the stand-alone version generally delivered better results (Fig. Figure 7). Thus, this is the recommended use of the tool and the application on extracted separate image files is to be avoided. For the stand-alone version, the occasional errors in the “bar” and “approp” categories were due to proportional data not recognized as such or histograms or bardots were miss-classified. It also more readily detected “hist” images compared to the pipeline version, although some of these detections were false-positives. There were no false negatives and only a few false positives for the “dot” and “bardot” categories, with commonly missidentified dots with whiskers, scatter plots or “bardots” with barely any bars visible. Similarly, many densely-packed dotplots or boxplots were mistaken for “violin” plots. Finally, there were several gene structure schematics and symbol-whisker plots with large squares that were mistakenly classified as “box”.